Goto

Collaborating Authors

 residual distillation


Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

By transferring both features and gradients between different layers, shortcut connections explored by ResNets allow us to effectively train very deep neural networks up to hundreds of layers. However, the additional computation costs induced by those shortcuts are often overlooked. For example, during online inference, the shortcuts in ResNet-50 account for about 40 percent of the entire memory usage on feature maps, because the features in the preceding layers cannot be released until the subsequent calculation is completed. In this work, for the first time, we consider training the CNN models with shortcuts and deploying them without. In particular, we propose a novel joint-training framework to train plain CNN by leveraging the gradients of the ResNet counterpart.


Review for NeurIPS paper: Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

Weaknesses: * There are numerous approaches to reduce ConvNet's memory footprint and computational resources at inference time, including but not limited to channel pruning, dynamic computational graph, and model distillation. Why is removing shortcut connection the best way to achieve the same goal? The baselines considered in Table 3 and 4 are rather lacking. For example, how does the proposed method compare to: 1. Pruning method that reduces ResNet-50 channel counts to match the memory footprint and FLOPs of plain-CNN 50. What will be the drop in accuracy?


Review for NeurIPS paper: Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

The new training scheme following the teacher-student paradigm to obtain comparable results to those of a resnet model, but without residual connections (shortcuts). Results are on par with SOTA and the approach is very interesting, although not necessarily very novel in principle (I encourage the authors to make this much clearer in the final text). All reviewers agree that this is a good contribution and that the rebuttal was helpful in reaching the final conclusion.


Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts

Neural Information Processing Systems

By transferring both features and gradients between different layers, shortcut connections explored by ResNets allow us to effectively train very deep neural networks up to hundreds of layers. However, the additional computation costs induced by those shortcuts are often overlooked. For example, during online inference, the shortcuts in ResNet-50 account for about 40 percent of the entire memory usage on feature maps, because the features in the preceding layers cannot be released until the subsequent calculation is completed. In this work, for the first time, we consider training the CNN models with shortcuts and deploying them without. In particular, we propose a novel joint-training framework to train plain CNN by leveraging the gradients of the ResNet counterpart.